De novo transcriptome assembly and genome annotation of the fat-tailed dunnart (Sminthopsis crassicaudata)

Marsupials exhibit distinctive modes of reproduction and early development that set them apart from their eutherian counterparts and render them invaluable for comparative studies. However, marsupial genomic resources still lag far behind those of eutherian mammals. We present a series of novel genomic resources for the fat-tailed dunnart (Sminthopsis crassicaudata), a mouse-like marsupial that, due to its ease of husbandry and ex-utero development, is emerging as a laboratory model. We constructed a highly representative multi-tissue de novo transcriptome assembly of dunnart RNA-seq reads spanning 12 tissues. The transcriptome includes 2,093,982 assembled transcripts and has a mammalian transcriptome BUSCO completeness score of 93.3%, the highest amongst currently published marsupial transcriptomes. This global transcriptome, along with ab initio predictions, supported annotation of the existing dunnart genome, revealing 21,622 protein-coding genes. Altogether, these resources will enable wider use of the dunnart as a model marsupial and deepen our understanding of mammalian genome evolution.

Recently, the fat-tailed dunnart (Sminthopsis crassicaudata, NCBI:txid9301) has emerged as a key laboratory marsupial model for understanding mammalian development and evolution [42,[57][58][59][60][61].A nocturnal species belonging to the family Dasyuridae, the fat-tailed dunnart has adapted to a wide range of habitats and can be found across south and central mainland Australia [62] (Figure 1A and B).As one of the smallest carnivorous marsupials, adults weigh an average of 15 grams.Fat-tailed dunnarts exhibit some of the shortest known gestation times for mammals (13 days), with much of their development occurring postnatally.Fat-tailed dunnart neonates reside in their mother's pouch, thereby allowing continuous and non-invasive experimental access [63,64].The extremely altricial state of the dunnart young, along with very simple husbandry requirements, have facilitated the dunnart's role as a model species for comparative mammalian studies and conservation strategies.
However, the paucity of genomic resources for the fat-tailed dunnart has limited our understanding of this species at the gene level.As such, high-quality genome assembly and genome annotation have become increasingly important for investigations into the dunnart's unique biology.Recently, a draft fat-tailed dunnart genome assembly was released based on sequence data comprising ONT and PacBio long reads as well as Illumina HiSeq short reads [65].While this scaffold-level assembly is a significant resource, an improved workflow was necessary in order to increase the genome's contiguity and completeness.Moreover, due to the absence of a de novo transcriptome, gene annotations had to be lifted over from the Tasmanian devil (Sarcophilus harrisii, GCF_902635505.1 -mSarHar1.11)to the dunnart scaffolds, thereby producing an incomplete representation of dunnart gene structure.
To address this knowledge gap, we present a comprehensive de novo transcriptome built from RNA-seq data from 24 samples, spanning 12 tissues.This global transcriptome has recovered 93.3% of complete mammalian BUSCOs, indicating its functional completeness.
We also report the very first fat-tailed dunnart genome annotation.The genome annotation effort, made possible through the multi-tissue transcriptome assembly and ab initio predictions, yielded 21,622 protein-coding genes.Additionally, we provide an improved genome assembly that is 3.23 Gb in size with a scaffold N50 of 72.64 Mb.Annotated genomes and global transcriptomes are of paramount importance for attaching biological meaning to sequencing data.As such, this first-draft annotation and global transcriptome The fat-tailed dunnart's range across Australia (CC BY) [66].(C) Phylogeny of extant marsupial orders (based on [67] and [68]).The fat-tailed dunnart (blue font) is a member of the order Dasyuromorphia.can serve as tools with which the genomic architecture of the fat-tailed dunnart, an emerging marsupial model species, can be better understood.Most importantly, these comprehensive resources contribute to the growing body of research on marsupial genomics and are therefore invaluable tools for future mammalian studies.
RNA samples were pooled in approximately equal proportions for Iso-Seq, namely, allantois, amnion, distal and proximal yolk sacs, endometrium, oviduct, ovary, testis, liver, eye, gastrula-stage conceptus, and late fetus.All RNA samples were extracted using Qiagen RNeasy Mini or Micro kits according to the manufacturer's instructions, with Illumina and Iso-Seq library construction and sequencing outsourced to Azenta Life Sciences (USA).For Illumina sequencing, this included rRNA depletion and strand-specific RNA library preparation, multiplexing, and sequencing on the NovaSeq platform, in a 2 × 150-bp (paired-end) configuration for 23 samples.Iso-Seq (poly-A selected and strand-specific) was performed using a PacBio Sequel II platform (1 sample, mean length of 5,400 bp).RNA Integrity Numbers (RIN) were generated using Bioanalyzer, and are available through Figshare [77].
To generate a global dunnart transcriptome, the trimmed, paired-end RNA-seq reads were used as input to Trinity v2.13.2 (RRID:SCR_013048) [79].We Applied default in silico read normalization and set the minimum assembled contig length to report to 200.Circular consensus reads were incorporated for Iso-seq long-read correction (parameter: -long_reads).Contig assembly was executed using three different k-mer settings: 25, 29, and 32.We chose these values because 25 and 32 are the minimum and maximum permitted values for the Trinity contig assembly step.Assembly statistics were obtained using the Trinity script TrinityStats.pl[79].A reference-free evaluation of assembly quality was conducted using RSEM-EVAL, a component package of Detonate v1.11 [80].RSEM-EVAL provides a weighted quality score using a probabilistic model.Although these scores are always negative, when comparing two assemblies, a higher value represents a higher-quality assembly.The completeness of the full-length assemblies was evaluated using Benchmarking Universal Single-Copy Orthologs (BUSCO) [76].The BUSCO gene sets are comprised of nearly universally distributed single-copy orthologous genes representing various phylogenetic levels.Here, BUSCO v5.2.2 assessment was carried out in transcriptome mode using the Mammalia_odb v10 database of orthologs.
To quantify the RNA-seq read representation of the assembly, all reads were mapped back to the global transcriptome assembly using Bowtie2 v2.4.5 (RRID:SCR_005476) [81], setting a maximum of 20 distinct alignments for each read (parameter: -k 20).Transcript abundance was quantified using RSEM v1.3.3 [82], with Bowtie2 read alignments.Prior to annotation, transcript redundancy in the global transcriptome was reduced using CD-HIT v4.8.1 [83] with a homology threshold of 1 (parameter: -c 1) to avoid filtering out true isoforms.

RESULTS
To generate a genome-level annotation for the fat-tailed dunnart, we began by producing an improved draft genome assembly.We employed a hybrid approach, which integrated the  1).
A de novo reconstruction of the dunnart transcriptome was conducted using a set of 24 RNA-seq samples originating from the liver, testis, prostate, ovary, oviduct, uterus, eye, whole neonate, allantois, amnion, distal yolk sac, proximal yolk sac, and endometrium.To ensure that the most representative assembly was obtained, we sought to identify the optimal k-mer length for the Trinity contig assembly step, considering k values of 25, 29, and 32 (Table 2).Given that reference-free transcriptome assembly relies on grouping overlapping sequences of read fragments of a predetermined size (i.e., the k-mer), identifying the optimal fragment size might yield a more accurate assembly.To assess this fragment size effect, we computed multiple assembly quality metrics, including the BUSCO completeness score (transcriptome mode) and the Detonate RSEM-EVAL score for each Trinity run.The RSEM-EVAL score represents the sum of three main factors: likelihood estimates of the read representation within the assembly, the assembly prior, which assumes that each contig is generated independently, and the BIC (Bayesian Information Criterion) penalty [80].When comparing two assemblies, a higher RSEM-EVAL score is indicative of a more complete transcriptome assembly.In our comparison, the Trinity run with a k-mer setting of 29 produced the top-scoring assembly; thus, all subsequent analysis was carried out using this assembly.
This transcriptome assembly was composed of 2,093,982 assembled transcripts (including splicing isoforms), with a GC content of 40.2% and a mean transcript length of 830 bp (Table 2).The transcript N50 was 1,489 bp, and considering only the top 90% most highly expressed transcripts (a more accurate proxy for transcriptome quality [104]) produced an E90N50 of 3,430 bp.Sample reads that were mapped back to the assembly had a very high overall alignment rate (98%), with a high percentage mapped as proper pairs (94%).In addition, the global transcriptome had a 93.3% recovery of complete mammalian BUSCOs (Mammalia_odb v10 [76]).These values are in line with, or higher than, those reported from all other available marsupial transcriptome datasets (Table 3).Specifically,   transcripts with an N50 of 687 bp and a 95% alignment rate of sample reads to the assembly [105].
We used InterProScan to identify conserved domains and assign Gene Ontology (GO) terms.
A total of 24,366 transcripts were assigned InterProScan terms, and 13,507 unique genes were assigned GO terms.The most common GO terms were intracellular anatomical structure (17,995 genes), organelle (17,140 genes), protein binding (17,071), cytoplasm (15,355 genes), and regulation of cellular processes (14,193 genes, Figure 3A).Notably, in another marsupial species, the woylie, cellular processes were also the most common GO term under the Biological Processes (BP) category [107].Our GO annotations totaled 289,985, with a mean annotation level of 7.15 and a standard deviation of 2.7 (Figure 3B).
Running an HMMer search against the PFAM database yielded 16,308 domains, while dbCAN3 and MEROPS analyses resulted in 212 and 1,053 predictions, respectively.

CONCLUSION
The increased availability of genomic resources for marsupial species is critical for fostering a deeper understanding of the evolutionary history of both eutherians and marsupials.In this study, we report an enhanced fat-tailed dunnart genome assembly

Figure 2 .
Figure 2. Schematic illustrating the de novo transcriptome generation and genome annotation workflow for the fat-tailed dunnart.

Figure 3 .
Figure 3. Gene ontology (GO) analysis of the fat-tailed dunnart putative genes.(A) GO distribution by category (at level 3) for the fat-tailed dunnart gene set.The ontology categories are BP (Biological Process), MF (Molecular Function), and CC (Cellular Component).The top 20 terms are listed for each category.(B) Distribution of sequence annotations for each GO level.

measuring 3 .
23 Gb in length.The assembly is organized into 1,848 scaffolds, with a scaffold N50 value of 72.64 Mb.We generated a global de novo transcriptome assembly of the fat-tailed dunnart using RNA-seq short-read and long-read data, which were sampled from a diverse range of dunnart tissues.The transcriptome reconstruction consisted of 2,093,982 assembled transcripts, with a mean transcript length of 830 bp.The transcriptome BUSCO completeness score of 93.3% is the highest amongst all other published marsupial transcriptome BUSCOs (i.e., numbat and brown antechinus).The high overall alignment rate of reads from each of the tissues to the transcriptome (98%) further underscores that the de novo transcriptome is a highly accurate representation of the input reads.The dunnart draft genome annotation revealed 21,622 protein-coding genes, in line with previously reported marsupial gene counts.Overall, these resources provide novel insights into the unique genomic architecture of the fat-tailed dunnart and will therefore serve as valuable tools for future comparative mammalian studies.

Table 1 .
Fat-tailed dunnart genome assembly statistics compared to the numbat, koala, Tasmanian devil, brown antechinus, tammar wallaby, gray shorttailed opossum, and eastern quoll reference genomes currently available on NCBI.

-tailed dunnart (this study) Fat-tailed dunnart [65] Numbat [55] Koala [52] Tasmanian devil [43] Brown antechinus [53] Tammar wallaby [52] Gray short-tailed opossum [51] Eastern quoll [56]
[50]and PacBio long-read data with Illumina paired-end short reads[50].This resulted in a 3.23 Gb genome that contains 1,848 scaffolds and has a scaffold N50 of 72.64 Mb.The GC content of this draft genome is 36.2%(Table1).The recovery of complete, single-copy mammalian BUSCOs was 94.2%.Together, these metrics are indicative of a high-quality genome assembly, with marked improvements over the existing dunnart draft genome and notably higher completeness and contiguity compared to other marsupial reference genomes currently available on NCBI (Table

Table 2 .
Summary of the de novo transcriptome assembly statistics for the Trinity k-mer optimization.

Table 3 .
Summary of global transcriptomes from marsupial species.

Table 4 .
Fat-tailed dunnart gene and feature statistics.